Extracting Role Titles

Our goal is to extract role information from job ads to try to understand the job ads better. This is a pretty complex task: role titles are hidden in the text, and can be very ambiguous (“Manager”) or very specific (“Subsea Cabling Engineer”). This notebook scopes out the problem and looks at extracting common examples of role titles.

import pandas as pd
from pathlib import PosixPath

Load in the Data

Get the data from Adzunda Job Salary Prediction Kaggle Competition, put it in the data subfolder and unzip all the files.

You can do this manually, or use the Kaggle API (once you’ve installed the API, downloaded your kaggle.json file and agreed to the competition rules)

# for split, ext in [('Test', 'zip'), ('Train', 'zip'), ('Valid', 'csv')]:
#     !kaggle competitions download -c job-salary-prediction --path data/ -f {split}_rev1.{ext}
    
# !find data/ -name '*.zip' -execdir unzip '{}' ';'
# !find data/ -name '*.zip' -exec rm '{}' ';'

# !ls data/
%%time
dfs = []
for split in ['Train', 'Valid', 'Test']:
    dfs.append(pd.read_csv(f'data/{split}_rev1.csv').assign(split=split))
df = pd.concat(dfs, sort=False, ignore_index=True)
df['Title'] = df['Title'].fillna('')
del dfs
CPU times: user 6.52 s, sys: 3.06 s, total: 9.58 s
Wall time: 12.6 s
len(df)
407894
pd.options.display.max_columns = 200
pd.options.display.max_colwidth = 100

There are a bunch of different information in the role titles:

  • Roles: “Engineering Systems Analyst”, “Stress engineer”, “Subsea cables engineer”
  • Location like “Glasgow” or “East Midlands”
  • Seniority like “Senior”, “Principal”, “Lead”, or “Trainee”
  • Industry: Like “Pharmaceutical”, “Construction”,
  • Selling points/working conditions of the job: “Award Winning Restaurant”, “Excellent Tips”, “Self Employed”, “does it get any better than this?”
  • Company names: “Nevill Crest and Gun”, “The Refectory”

Sometimes there are multiple roles (often multiple descriptions of the same role):

  • Engineering Systems Analyst / Mathematical Modeller
  • Electrical / ICA Engineer

Sometimes it’s ambiguous: is “Modelling and simulation analyst” one role or two (“modelling analyst” and “simulation analyst”?); similarly with “C/C++ developer”. Is “Bilinguial Reservationist” a role title, or is it just “Reservationaist” and “Bilingual” is a skill required for the job?

To understand the job we’ll also need to understand some of the acronyms like:

  • MICE Sales: Meetings, incentives, conferences and exhibitions
  • ICA Engineer: Instrumentation Control and Automation
df.Title.head(50).reset_index()
index Title
0 0 Engineering Systems Analyst
1 1 Stress Engineer Glasgow
2 2 Modelling and simulation analyst
3 3 Engineering Systems Analyst / Mathematical Modeller
4 4 Pioneer, Miser Engineering Systems Analyst
5 5 Engineering Systems Analyst Water Industry
6 6 Senior Subsea Pipeline Integrity Engineer
7 7 RECRUITMENT CONSULTANT INDUSTRIAL / COMMERCIAL / ENGINEERING / DRIV
8 8 RECRUITMENT CONSULTANT CONSTRUCTION / TECHNICAL / TRADES LABOUR
9 9 Subsea Cables Engineer
10 10 Trainee Mortgage Advisor East Midlands
11 11 PROJECT ENGINEER, PHARMACEUTICAL
12 12 Principal Composite Stress Engineer
13 13 Senior Fatigue Damage Tolerance Engineer
14 14 Chef de Partie Award Winning Restaurant Excellent Tips
15 15 Quality Engineer
16 16 Principal Controls Engineer
17 17 Chef de Partie Award Winning Dining Live In Share of Tips
18 18 Senior Fatigue and Damage Tolerance Engineer
19 19 C I Design Engineer
20 20 Lead Engineers (Stress)
21 21 Relief Chef de Partie Croydon, Surrey Live in
22 22 Senior Control and Instrumentation Engineer
23 23 Control and Instrumentation Engineer
24 24 Electrical / ICA Engineer
25 25 Pastry Chef for **** red star **** rosette hotel ****
26 26 Senior Process Engineer
27 27 CHEF DE PARTIE POSITION IN **** ROSETTE HOTEL NYORKS ****k
28 28 Senior Sous Chef for **** rosette kitchen, up to ****
29 29 General Manager Funky, Cool Restaurant Concept London ****k
30 30 MICE Sales and Marketing Manager
31 31 C/C++ Developer
32 32 Senior PHP Developer
33 33 Senior Website Designer
34 34 Business Development Manager
35 35 Welwyn Chef de Partie does it get any better than this? ****
36 36 Chef de Partie Sauce Award Winning Hertford ****
37 37 Pastry Chef AL**** ****AA Rosette Restaurant
38 38 QA Engineer
39 39 Documentation Engineer
40 40 Bilingual Customer Service Operator
41 41 Customer Event Coordinator (German speaking)
42 42 Senior Planner
43 43 Bilingual Reservationist (Customer Service)
44 44 Trampoline Coach Bushey Grove Leisure Centre
45 45 Self Employed Swimming Instructors
46 46 Self Employed Sport Coaches
47 47 Bar/Waiting Staff The Cricketers, Sarratt
48 48 Deputy Manager Nevill Crest and Gun, Eridge Green
49 49 Bar/Waiting Staff The Refectory, Godalming

Let’s look at the most frequent titles. If different companies use the same title it’s much less likely to have specific job features (like location, company info, or benefit).

titles = (
df
 .groupby('Title')
 .agg(companies=('Company', 'nunique'), jobs=('Id', 'count'))
 .sort_values(['companies', 'jobs'], ascending=False)
)
len(titles)
196165

Only 20% of the ad titles occur in more than 1 company

(titles['companies'] > 1).mean()
0.1913440216144572

10% of the ad titles occur in 0 companies. This is likely because the title is empty and pandas read it in as NA. This is small enough that we can ignore it for this purpose

(titles['companies'] == 0).mean()
0.10200086661738843

Cutting off at 2 there are still some weird things here.

titles[titles.companies == 2]
companies jobs
Title
Assistant Sales Manager Market Leading Retailer 2 66
Vehicle Purchaser / Car Sales 2 55
AREA RELIEF OFFICER 2 53
Vehicle Technician MOT Tester 2 42
Staff Nurse (RGN) Nursing Home 2 33
... ... ...
warehouse assistant 2 2
warehouse operatives 2 2
web designer 2 2
yEAR ****/4 TEACHER CARLTON **** PER DAY 2 2
zSeries Specialist zSeries UK Wide 2 2

25416 rows × 2 columns

One reason is the same job can come through two different job boards (SourceName), and they may have different ways of representing the company name or have errors obtaining it.

For example “hyphen” Company sounds like a mistake here.

df[df.Title.str.startswith('zS')]
Id Title FullDescription LocationRaw LocationNormalized ContractType ContractTime Company Category SalaryRaw SalaryNormalized SourceName split
49509 68626801 zSeries Specialist zSeries UK Wide zSeries Technical Specialist required for London, My high profile client (leading financial bran... London London NaN permanent Spring Technology IT Jobs 32000.00 - 42000.00 GBP Annual 37000.0 jobserve.com Train
63044 68702465 zSeries Specialist zSeries UK Wide zSeries Technical Specialist required for London , My high profile client (leading financial bra... City London South East London NaN permanent hyphen IT Jobs 32000 - 42000 per annum 37000.0 totaljobs.com Train

Here the company for the second job is ‘UKStaffsearch’ which is the name of the job board. The job board must replace the title.

Note that one is from the Train set and one from the Test set! This is a data leak.

df[df.Title.str.startswith('yEA')]
Id Title FullDescription LocationRaw LocationNormalized ContractType ContractTime Company Category SalaryRaw SalaryNormalized SourceName split
140597 70577243 yEAR ****/4 TEACHER CARLTON **** PER DAY Year ****/4 Teacher required for Mapperley Area TeacherActive are currently recruiting for a Pri... Nottingham, Nottinghamshire, England, West Yorkshire UK NaN contract TeacherActive Teaching Jobs 93 - 140/day 27960.0 cv-library.co.uk Train
377237 71623608 yEAR ****/4 TEACHER CARLTON **** PER DAY Year ****/4 Teacher required for Mapperley Area TeacherActive are currently recruiting for a Pri... Nottinghamshire - Nottingham Nottingham full_time permanent UKStaffsearch HR & Recruitment Jobs NaN NaN ukstaffsearch.com Test

Notice the double space in the job title.

These are all posted by the same company in multiple locations but totaljobs.com has the company name as ‘Triple S Recruitment’ and cv-library.co.uk has it as ‘Triple S Recruitment Ltd’

df[df.Title == ('Assistant Sales Manager  Market Leading Retailer')].sort_values('Company')
Id Title FullDescription LocationRaw LocationNormalized ContractType ContractTime Company Category SalaryRaw SalaryNormalized SourceName split
30332 68062445 Assistant Sales Manager Market Leading Retailer This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... Bolton Lancashire North West Bolton Le Sands NaN permanent Triple S Recruitment Sales Jobs OTE 35-45k plus benefits 40000.0 totaljobs.com Train
227637 72444806 Assistant Sales Manager Market Leading Retailer This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... Colne, Lancashire Lancashire North West Colne NaN permanent Triple S Recruitment Sales Jobs OTE 25- 30k plus benefits 27500.0 totaljobs.com Train
230526 72452426 Assistant Sales Manager Market Leading Retailer This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... Stirling Stirlingshire Scotland UK NaN permanent Triple S Recruitment Sales Jobs OTE 30-35k plus benefits 32500.0 totaljobs.com Train
230527 72452429 Assistant Sales Manager Market Leading Retailer This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... Brentford Middlesex South East UK NaN permanent Triple S Recruitment Sales Jobs OTE 35-40k plus benefits 37500.0 totaljobs.com Train
230936 72454431 Assistant Sales Manager Market Leading Retailer This leading UK retailer has enjoyed over 40 years of success and is a market leader in their fi... Dundee Angus Scotland UK NaN permanent Triple S Recruitment Sales Jobs OTE 35-45k plus benefits 40000.0 totaljobs.com Train
... ... ... ... ... ... ... ... ... ... ... ... ... ...
206431 72120567 Assistant Sales Manager Market Leading Retailer The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... Cambridge, Cambridgeshire Cambridge NaN permanent Triple S Recruitment Ltd Retail Jobs 15000 - 35000/annum OTE 30-35k plus benefits 25000.0 cv-library.co.uk Train
206432 72120572 Assistant Sales Manager Market Leading Retailer The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... Llandudno, Wales Llandudno NaN permanent Triple S Recruitment Ltd Retail Jobs 15000 - 35000/annum OTE 30-35k plus benefits 25000.0 cv-library.co.uk Train
279043 72120569 Assistant Sales Manager Market Leading Retailer The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... Cannock, Staffordshire Cannock NaN permanent Triple S Recruitment Ltd Retail Jobs NaN NaN cv-library.co.uk Valid
388642 72120555 Assistant Sales Manager Market Leading Retailer The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... Stockton on Tees, North East Stockton-On-Tees NaN permanent Triple S Recruitment Ltd Retail Jobs NaN NaN cv-library.co.uk Test
206426 72120557 Assistant Sales Manager Market Leading Retailer The future of our client and all of their staff couldn t be brighter, or more exciting. As Brita... Stirling, Scotland Stirling NaN permanent Triple S Recruitment Ltd Retail Jobs 15000 - 35000/annum OTE 30-35k plus benefits 25000.0 cv-library.co.uk Train

66 rows × 13 columns

titles[titles.companies == 8]
companies jobs
Title
GRADUATE SALES EXECUTIVE / GRADUATE ACCOUNT MANAGER 8 110
Account Manager / Sales Executive 8 58
Relief Support Worker 8 41
LGV CE Driver 8 39
English Teaching Assistant 8 36
... ... ...
Senior Data Analyst 8 8
Senior Electrical Estimator 8 8
Syndicate Accountant 8 8
Telephone Researcher 8 8
Web Content Editor 8 8

288 rows × 2 columns

Even at 8 Companies we still get some false positives.

These are all the same job ad!

df[df.Title == 'GRADUATE SALES EXECUTIVE / GRADUATE ACCOUNT MANAGER'].Company.value_counts()
BMS Sales Specialists LLP              27
BMS   Graduate                         16
BMS Graduates                          15
London4Jobs                             5
BMS GROUP                               4
BMS Sales and Marketing Specialists     4
UKStaffsearch                           2
BMS Graduate Recruitment                1
Name: Company, dtype: int64

We’ll start the cutoff at 10; the data is reasonably clean there, and captures the top 1% of role titles.

(titles.companies >= 10).mean(), (titles.companies >= 10).sum()
(0.008212474192643949, 1611)

Output into a CSV for further analysis in a spreadsheet program.

!mkdir -p output
titles[titles.companies >= 10].to_csv('output/common_titles.csv')